Preliminaries

We're going to build and compare a few malware machine learning models in this series of Jupyter notebooks. Some of them require a GPU. I've used a Titan X GPU for this exercise. If yours isn't as beefy, you may get tensorflow memory errors that may require modifying some of the code, namely file_chunks and file_chunk_size. (I'll point to it later.) But, to get started, the first few exercises will work on even that GPU you're embarrassed to tell people about, or if you're willing to wait, no GPU at all.

For the fancy folks who have multiple GPUs, we're going to restrict usage to the first one.


In [1]:
%env CUDA_VISIBLE_DEVICES=0 # limit GPU usage, if any to this GPU


env: CUDA_VISIBLE_DEVICES=0 # limit GPU usage, if any to this GPU

Also note that this exercise assumes you've already populated a malicious/ and a benign/ directory with samples that you consider malicious and benign, respectively. How many samples? In this notebook, I'm using 50K of each for demonstration purposes. Sadly, you must bring your own. If you don't populate these subdirectories for binaries (each renamed to the sha256 hash of its contents!), the code will bicker and complain incessently.

Feature extraction for feature-based models

There is a lot of domain knowledge on what malware authors can do, and what malware authors actually do when crafting malicious files. Furthermore, there are some things malware authors seldom do that would indicate that a file is benign. For each file we want to analyze, we're going to encapsulate that domain knowledge about malicious and benign files in a single feature vector. See the source code at classifier/pefeatures.py.

Note that the feature extraction we use here contains many elements from published malware classification papers. Some of those are slightly modified. And there are additional features in this particular feature extraction that are included because, well, they were just sitting there in the LIEF parser patiently waiting for a chair at the feature vector table. Read: there's really no secret sauce in there, and to turn this into something commercially viable would take a bit of work. But, be my guest.

A note about LIEF. What a cool tool with a great mission! It aims to parse and manipulate binary files for Windows (PE), Linux (ELF) and MacOS (macho). Of course, we're using only the PE subset here. At the time of this writing, LIEF is still very much a new tool, and I've worked with the authors to help resolve some kinks. It's a growing project with more warts to find and fix. Nevertheless, we're using it as the backbone for features that requires one to parse a PE file.


In [2]:
from classifier import common

In [3]:
# this will take a LONG time the first time you run it (and cache features to disk for next time)
# it's also chatty.  Parts of feature extraction require LIEF, and LIEF is quite chatty.
# the output you see below is *after* I've already run feature extraction, so that
#   X and sample_index are being read from cache on disk
X, y, sha256list = common.extract_features_and_persist() 

# split our features, labels and hashes into training and test sets
from sklearn.model_selection import train_test_split
import numpy as np
np.random.seed(123)
X_train, X_test, y_train, y_test, sha256_train, sha256_test = train_test_split( X, y, sha256list, test_size=1000) 
# a random train_test split, but for a malware classifier, we should really be holding out *future* malicious and benign 
# samples, to better capture how we'll generalize to malware yet to be seen in the wild. ...an exercise left to the reader..


took 3.0163469314575195 seconds

Multilayer perceptron

We'll use the features we extracted to train a multilayer perceptron (MLP). An MLP is an artificial neural network with at least one hidden layer. Is a multilayer perceptron "deep learning"? Well, it's a matter of semantics, but "deep learning" may imply that the features and model are optimized together, end-to-end. So, it that sense, no: since we're using domain knowledge to extract features, then pass it to an artificial neural network, we'll remain conservative and call this an MLP. (As we'll see, don't get fooled just because we're not calling this "deep learning": this MLP is no slouch.) The network architecture is defined in classifier/simple_multilayer.py.


In [4]:
# StandardScaling the data can be important to multilayer perceptron
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)

# Note that we're using scaling info form X_train to transform both
X_train = scaler.transform(X_train) # scale for multilayer perceptron
X_test = scaler.transform(X_test)

from classifier import simple_multilayer
from keras.callbacks import LearningRateScheduler, EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
model = simple_multilayer.create_model(
    input_shape=(X_train.shape[1], ),     # input dimensions
    input_dropout=0.05,                   # this prevents the model becoming a fanboy of (overfitting to) any particular input feature
    hidden_dropout=0.1,                   # same, but for hidden units.  Dropping out hidden layers can create a sort of ensemble learner
    hidden_layers=[4096, 2048, 1024, 512] # this is "art". making up # of hidden layers and width of each. don't be afraid to change this
)
model.fit(X_train, y_train,
          batch_size=128,
          epochs=200,
          verbose=1,
          callbacks=[
              EarlyStopping( patience=20 ),
              ModelCheckpoint( 'multilayer.h5', save_best_only=True),
              ReduceLROnPlateau( patience=5, verbose=1)],
          validation_data=(X_test, y_test))

from keras.models import load_model
# we'll load the "best" model (in this case, the penultimate model) that was saved 
# by our ModelCheckPoint callback
model = load_model('multilayer.h5')

y_pred = model.predict(X_test)
common.summarize_performance(y_pred, y_test, "Multilayer perceptron") 
# The astute reader will note we should be doing this on a separate holdout, since we've explicitly
# saved the model that works best on X_test, y_test...an exercise for left for the reader...


Using TensorFlow backend.
Train on 98997 samples, validate on 1000 samples
Epoch 1/200
98997/98997 [==============================] - 17s - loss: 0.2164 - acc: 0.9148 - val_loss: 0.1322 - val_acc: 0.9470
Epoch 2/200
98997/98997 [==============================] - 11s - loss: 0.1541 - acc: 0.9408 - val_loss: 0.1200 - val_acc: 0.9500
Epoch 3/200
98997/98997 [==============================] - 12s - loss: 0.1344 - acc: 0.9485 - val_loss: 0.1143 - val_acc: 0.9500
Epoch 4/200
98997/98997 [==============================] - 11s - loss: 0.1227 - acc: 0.9531 - val_loss: 0.1147 - val_acc: 0.9510
Epoch 5/200
98997/98997 [==============================] - 11s - loss: 0.1150 - acc: 0.9559 - val_loss: 0.1097 - val_acc: 0.9580
Epoch 6/200
98997/98997 [==============================] - 11s - loss: 0.1061 - acc: 0.9601 - val_loss: 0.1078 - val_acc: 0.9590
Epoch 7/200
98997/98997 [==============================] - 11s - loss: 0.1015 - acc: 0.9611 - val_loss: 0.1158 - val_acc: 0.9580
Epoch 8/200
98997/98997 [==============================] - 11s - loss: 0.0949 - acc: 0.9637 - val_loss: 0.1053 - val_acc: 0.9610
Epoch 9/200
98997/98997 [==============================] - 11s - loss: 0.0929 - acc: 0.9640 - val_loss: 0.1111 - val_acc: 0.9620
Epoch 10/200
98997/98997 [==============================] - 11s - loss: 0.0886 - acc: 0.9663 - val_loss: 0.1055 - val_acc: 0.9610
Epoch 11/200
98997/98997 [==============================] - 11s - loss: 0.0862 - acc: 0.9669 - val_loss: 0.1048 - val_acc: 0.9600
Epoch 12/200
98997/98997 [==============================] - 11s - loss: 0.0819 - acc: 0.9684 - val_loss: 0.1049 - val_acc: 0.9620
Epoch 13/200
98997/98997 [==============================] - 11s - loss: 0.0810 - acc: 0.9691 - val_loss: 0.1077 - val_acc: 0.9640
Epoch 14/200
98997/98997 [==============================] - 11s - loss: 0.0784 - acc: 0.9703 - val_loss: 0.0999 - val_acc: 0.9620
Epoch 15/200
98997/98997 [==============================] - 11s - loss: 0.0757 - acc: 0.9713 - val_loss: 0.1065 - val_acc: 0.9650
Epoch 16/200
98997/98997 [==============================] - 11s - loss: 0.0723 - acc: 0.9727 - val_loss: 0.1069 - val_acc: 0.9650
Epoch 17/200
98997/98997 [==============================] - 11s - loss: 0.0709 - acc: 0.9732 - val_loss: 0.1099 - val_acc: 0.9670
Epoch 18/200
98997/98997 [==============================] - 11s - loss: 0.0699 - acc: 0.9732 - val_loss: 0.1046 - val_acc: 0.9680
Epoch 19/200
98997/98997 [==============================] - 11s - loss: 0.0688 - acc: 0.9738 - val_loss: 0.1051 - val_acc: 0.9650
Epoch 20/200
98944/98997 [============================>.] - ETA: 0s - loss: 0.0673 - acc: 0.9746
Epoch 00019: reducing learning rate to 0.0009999999776482583.
98997/98997 [==============================] - 11s - loss: 0.0672 - acc: 0.9747 - val_loss: 0.1020 - val_acc: 0.9640
Epoch 21/200
98997/98997 [==============================] - 11s - loss: 0.0616 - acc: 0.9761 - val_loss: 0.1001 - val_acc: 0.9660
Epoch 22/200
98997/98997 [==============================] - 11s - loss: 0.0606 - acc: 0.9767 - val_loss: 0.1006 - val_acc: 0.9670
Epoch 23/200
98997/98997 [==============================] - 11s - loss: 0.0611 - acc: 0.9763 - val_loss: 0.1030 - val_acc: 0.9670
Epoch 24/200
98997/98997 [==============================] - 11s - loss: 0.0619 - acc: 0.9764 - val_loss: 0.1014 - val_acc: 0.9680
Epoch 25/200
98944/98997 [============================>.] - ETA: 0s - loss: 0.0599 - acc: 0.9772
Epoch 00024: reducing learning rate to 9.999999310821295e-05.
98997/98997 [==============================] - 11s - loss: 0.0599 - acc: 0.9772 - val_loss: 0.1024 - val_acc: 0.9670
Epoch 26/200
98997/98997 [==============================] - 11s - loss: 0.0601 - acc: 0.9767 - val_loss: 0.1016 - val_acc: 0.9670
Epoch 27/200
98997/98997 [==============================] - 11s - loss: 0.0594 - acc: 0.9773 - val_loss: 0.1017 - val_acc: 0.9670
Epoch 28/200
98997/98997 [==============================] - 11s - loss: 0.0588 - acc: 0.9777 - val_loss: 0.1019 - val_acc: 0.9670
Epoch 29/200
98997/98997 [==============================] - 11s - loss: 0.0602 - acc: 0.9772 - val_loss: 0.1017 - val_acc: 0.9670
Epoch 30/200
98944/98997 [============================>.] - ETA: 0s - loss: 0.0595 - acc: 0.9774
Epoch 00029: reducing learning rate to 9.999999019782991e-06.
98997/98997 [==============================] - 11s - loss: 0.0595 - acc: 0.9774 - val_loss: 0.1022 - val_acc: 0.9670
Epoch 31/200
98997/98997 [==============================] - 11s - loss: 0.0587 - acc: 0.9776 - val_loss: 0.1028 - val_acc: 0.9670
Epoch 32/200
98997/98997 [==============================] - 11s - loss: 0.0594 - acc: 0.9774 - val_loss: 0.1028 - val_acc: 0.9670
Epoch 33/200
98997/98997 [==============================] - 11s - loss: 0.0604 - acc: 0.9773 - val_loss: 0.1020 - val_acc: 0.9670
Epoch 34/200
98997/98997 [==============================] - 11s - loss: 0.0596 - acc: 0.9769 - val_loss: 0.1014 - val_acc: 0.9670
Epoch 35/200
98816/98997 [============================>.] - ETA: 0s - loss: 0.0607 - acc: 0.9768
Epoch 00034: reducing learning rate to 9.99999883788405e-07.
98997/98997 [==============================] - 11s - loss: 0.0607 - acc: 0.9769 - val_loss: 0.1023 - val_acc: 0.9670
** Multilayer perceptron **
ROC AUC = 0.993257095265552
threshold=0.8988195061683655: 0.9070247933884298 TP rate @ 0.009689922480620155 FP rate
confusion matrix @ threshold:
[[511   5]
 [ 46 438]]
accuracy @ threshold = 0.949
Out[4]:
(0.99325709526555195,
 0.89881951,
 0.0096899224806201549,
 0.90702479338842978,
 array([[511,   5],
        [ 46, 438]]),
 0.94899999999999995)

Sanity check: random forest classifier

Alright. Is that good? Let's compare to another model. We'll reach for the simple and reliable random forest classifier?

One nice thing about tree-based classifiers like a random forest classifier is that they are invariant to linear scaling and shifting of the dataset (the model will automatically learn those transformations). Nevertheless, for a sanity check, we're going to use the scaled/transformed features in a random forest classifier.


In [5]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# you can increase performance by increasing n_estimators, and removing the restriction on max_depth
# I've kept those in there because I want a quick-and-dirty look at how the MLP above
rf = RandomForestClassifier( 
    n_estimators=40, 
    n_jobs=-1, 
    max_depth=30
).fit(X_train, y_train)

y_pred = rf.predict_proba(X_test)[:,-1] # get probabiltiy of malicious (last class == last column )
_ = common.summarize_performance(y_pred, y_test, "RF Classifier")


** RF Classifier **
ROC AUC = 0.9944763437760266
threshold=0.7: 0.9276859504132231 TP rate @ 0.009689922480620155 FP rate
confusion matrix @ threshold:
[[512   4]
 [ 36 448]]
accuracy @ threshold = 0.96

How can we improve?

Really, it's not a terrible model, but it's nothing special. But, we'd really like to get to the realm of > 99% true positive rate at < 1% false positive rate.

Seems like we can do one of two things here:

  1. Spend some time working on our dataset, our labels, and our feature extraction, but use the same model.
  2. Make our model special. Really special.

Hey, end-to-end deep learning disrupted object detection, image recognition, speech recognition and machine translation. And that sounds way more interesting than item 1, so let's pull out some end-to-end deep learning for static malware detection!